home *** CD-ROM | disk | FTP | other *** search
- Cheap HTML parser
-
- Jim Davis
-
- davis@dri.cornell.edu
-
- July 1994
-
-
-
- This is code for doing simple processing on HTML. I know there are bugs and
-
- limitations in the code, but it suffices for simple purposes. Among the
-
- limitations: This is an HTML parser, not an SGML parser - it does not
-
- accept a DTD, rather the model of HTML is built into the code. Also it does
-
- not validate the HTML - it will attempt to parse invalid documents, and the
-
- results are undefined if the document is in error.
-
-
-
- The source code is available as a compressed Unix tar file. It runs under
-
- perl 4.0 patch level 36. I don't know about other versions of perl. This
-
- directory contains:
-
-
-
- parse-html.pl
-
- A simple HTML parser written in perl. As it parses the HTML, it calls
-
- routines (which you may redefine) for each tag encountered, and for
-
- whitespace and content. You can redefine these routines so as to
-
- process the HTML document.
-
- html-to-ascii.pl
-
- Uses the HTML parser to generate a plain ASCII version of an HTML
-
- document.
-
- html-ascii.pl
-
- The actual routines to generate the ASCII.
-
- tformat.pl
-
- A lowlevel text formatter used for generating ASCII. More or less like
-
- a subset of nroff
-
- html-to-rfc.pl
-
- Uses the HTML parser to generate a plain ASCII version of an HTML,
-
- with special formatting requirements for Internet drafts and RFCs
-
- rfc.pl
-
- Additional routines required for RFC formatting (e.g. page headers and
-
- footers)
-
-
-
- Generating RFCs from HTML
-
-
-
- The RFC format requires there be a header and footer containing, among
-
- other things, the name of the authors, a short title, and so on. You
-
- specify values for these fields with META tags as shown by the following
-
- example.
-
-
-
- <META name="status" content="Internet Draft">
-
- <META name="title" content="Internet audio protocol">
-
- <META name="date" content="July 1983">
-
- <META name="author" content="Nixon, Haldeman">
-
-
-
- (The META tag is not officially part of HTML, it was proposed by Roy
-
- Fielding.) The tags should be in the HEAD.
-
-
-
- Known bugs
-
-
-
- * It can't parse the prolog (or whatever you call it) because it does
-
- not know how to ensure that the square brackets match, e.g. the
-
- following
-
-
-
- <!DOCTYPE HTML [
-
- <!entity % HTML.Minimal "INCLUDE"<
-
- <!-- Include standard HTML DTD --<
-
- <!ENTITY % html PUBLIC "-//connolly hal.com//DTD WWW HTML 1.8//EN"<
-
- %html;
-
- ]<
-
-
-
- * font tags (e.g. CODE, EM) cause an extra whitespace in output e.g.
-
- <TT>foo</TT> yields "foo ,".
-
-